Metadata extration and text categorization using Universal Resource Locator expansions
نویسندگان
چکیده
Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can indicate metadata about a resource. This paper explores the mining of URLs to yield categoric metadata about web resources via a three-phase pipeline of word segmentation, abbreviation expansion and classification. I apply this approach to the problem of subject metadata generation and quantify its performance relative to titleand document-based methods, both which require the retrieval of the source document.
منابع مشابه
Metadata extraction and text categorization using Universal Resource Locator expansions
Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can indicate metadata about a resource. This paper explores the mining of URLs to yield categoric metadata about web resources via a three-phase pipeline of word segmentation, abbreviation expansion and classification. I apply this approach to the problem of subject metadat...
متن کاملJoint Web-Feature (JFEAT): A Novel Web Page Classification Framework
With the increasing amount of web pages over the internet, it has been a major concern to obtain information on the internet accurately at a reasonable cost with decent performance. A potential solution is through the classification of web pages into meaningful categories. An effective classification of web pages is of benefit to various applications such as web mining and search engines. Unlik...
متن کاملCategorizing Learning Objects Based On Wikipedia as Substitute Corpus
As metadata is often not sufficiently provided by authors of Learning Resources, automatic metadata generation methods are used to create metadata afterwards. One kind of metadata is categorization, particularly the partition of Learning Resources into distinct subject categories. A disadvantage of state-of-the-art categorization methods is that they require corpora of sample Learning Resources...
متن کاملGuest Editorial on Metadata
This is a special issue on the topic of metadata. An often-cited de®nition of metadata is `data about data'. In most cases, this means data that describe documents, for example, the author of a document, the date that a photograph was taken, or the Universal Resource Locator (URL) of a Web site. The World-Wide Web Consortium de®nes metadata as `machine understandable information for the Web' <h...
متن کاملMetadata for electronic information resources: From variety to interoperability
Metadata serves several purposes. It supports resource discovery, locates the actual digital resource by inclusion of a digital identifier, organizes electronic resources bringing similar resources together and distinguishing dissimilar resources, provides administrative information for controlling the digital library, and provides technical, preservation and rights management information neede...
متن کامل